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Abstract 

Increasing evidence has indicated that long non-coding RNAs (IncRNAs) are implicated in and associated with many 
complex human diseases. Despite of the accumulation of IncRNA-disease associations, only a few studies had studied the 
roles of these associations in pathogenesis. In this paper, we investigated IncRNA-disease associations from a network view 
to understand the contribution of these IncRNAs to complex diseases. Specifically, we studied both the properties of the 
diseases in which the IncRNAs were implicated, and that of the IncRNAs associated with complex diseases. Regarding the 
fact that protein coding genes and IncRNAs are involved in human diseases, we constructed a coding-non-coding gene- 
disease bipartite network based on known associations between diseases and disease-causing genes. We then applied a 
propagation algorithm to uncover the hidden IncRNA-disease associations in this network. The algorithm was evaluated by 
leave-one-out cross validation on 103 diseases in which at least two genes were known to be involved, and achieved an 
AUC of 0.7881. Our algorithm successfully predicted 768 potential IncRNA-disease associations between 66 IncRNAs and 193 
diseases. Furthermore, our results for Alzheimer's disease, pancreatic cancer, and gastric cancer were verified by other 
independent studies. 



Citation: Yang X, Gao L, Guo X, Shi X, Wu H, et al. (2014) A Network Based Method for Analysis of IncRNA-Disease Associations and Prediction of IncRNAs 
Implicated in Diseases. PLoS ONE 9(1): e87797. doi:10.1371/journal.pone.0087797 

Editor: Paolo Provero, University of Turin, Italy 

Received July 4, 2013; Accepted December 31, 2013; Published January 31, 2014 

Copyright: © 2014 Yang et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits 
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. 

Funding: This work was supported by the National Natural Science Foundation of China (Grant No. 60933009, 61070137, 91130006, 61303118, & 61303122), the 
website is http://www.nsfc.gov.cn/publish/portal1/ the Fundamental Research Funds for the Central Universities (XIDIAN UNIVERSITY, No. K50512230005 & 
K505 12230003). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. 

Competing Interests: The authors have declared that no competing interests exist. 

* E-mail: lgao@mail.xidian.edu.cn 



Introduction 

Long non-coding RNAs (IncRNAs) are similar to mRNAs in 
gene structure, with length greater than 200 nt [1-3]. LncRNAs 
play critical roles in many important biological processes such as 
chromatin modification [4], transcriptional and post- transcrip- 
tional regulation [4], and human diseases [2]. 

More and more studies have reported that mutated and 
dysfunctional IncRNAs are implicated in a broad range of human 
diseases. For example, Pasmant et al. [5] performed a GWAS and 
identified that ANRIL was significandy associated with coronary 
disease, type 2 diabetes, and many types of cancers. HOTAIR was 
increased from 100 to approximately 2,000-fold in breast cancer 
metastases using quantitative PCR [6]. AIALAT-1 was significantly 
associated with metastasis in NSCLC patients by quantitative RT- 
PCR [7]. With regard to Alzheimer's disease, BCAE1-AS was 
shown to have a key role in regulating BACE1 and in driving 
pathology [8]. Cui et al. [9] found that the expression of PlncRNA- 
1 was significandy higher in prostate cancer cells. Therefore, it is 
necessary to analyze the available IncRNA-disease associations 
and predict potential IncRNA-disease associations in human. Such 
studies will help us understand the molecular mechanisms of 
human diseases and identify biomarkers for disease diagnosis, 
treatment, and prevention at IncRNA level [10]. 



Chen et al. [10] reported a LncRNADisease database that 
includes approximately 480 entries of experimentally supported 
associations between 166 diseases and 118 IncRNAs. Moreover, 
we have manually collected 380 IncRNA-disease associations 
between 226 IncRNAs and 145 diseases by literature mining. By 
integrating these two data sets, we obtained 578 IncRNA-disease 
associations between 295 IncRNAs and 214 diseases. These data 
were analyzed in a network view and used to predict IncRNA- 
disease associations. 

In this paper, based on the available IncRNA-disease associa- 
tions, a IncRNA-disease association network was constructed. 
From the constructed network, two relevant biological networks 
"IncRNA-implicated disease network" (IncDN) and "disease- 
associated IncRNA network" (DlncN) were derived, as shown in 
Figure 1. In IncDN, a node represented a disease, and a link 
between two nodes indicated that the two corresponding diseases 
shared at least one IncRNA as their disease-causing IncRNA 
(Figure 1 and Figure 2-(a)). In DlncN, a node represented a 
IncRNA, and a link between two nodes represented the fact that 
the two corresponding IncRNAs were implicated in at least one 
common disease (Figure 1 and Figure 2-(b)). The known IncRNA- 
disease associations were represented in a single network 
framework, and the network topological properties were analyzed 
to help us investigate all of these associations. Furthermore, a 
propagation algorithm was applied to predict potential lncRNA- 
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disease associations on the IncRNA-disease association network. In 
addition, a coding-non-coding gene-disease bipartite network was 
constructed by integrating coding gene-disease associations 
obtained from OMIM [1 1] with IncRNA-disease associations. 
To achieve better prediction performance, the propagation 
algorithm was applied to rank the potential gene-disease pairs 
for all the diseases on the coding-non-coding gene-disease bipartite 
network. In the Leave-One-Out Cross- Validation (LOOCV) 
procedure, our method achieved a reliable Area Under Curve 
(AUC) of 0.7881. We then employed our method to the study of 
three multi-factorial diseases, Alzheimer's disease, pancreatic 
cancer and gastric cancer, and provided suggestions of novel 
disease-causing IncRNAs for further study. 

Materials and Methods 

Data Sources 

The 480 IncRNA-disease associations were downloaded from 
LncRNADisease database [10], including 118 IncRNAs and 166 
diseases. Note that many other IncRNA-disease associations have 
been reported in the literature, but have not been included in the 
LncRNADisease database yet. Hence, we retrieved literature from 
PubMed (http:/ /www.ncbi.nlm. nih.gov/pubmed), employing the 



key words 'IncRNA and disease', 'IncRNA and cancer', 'long non- 
coding RNA and disease', 'long non-coding RNA and cancer', 
'lincRNA and disease' or 'lincRNA and cancer', and manually 
extracted 129 articles that reported IncRNA-disease associations. 
In this way, we collected an additional 380 IncRNA-disease 
associations between 226 IncRNAs and 145 diseases by literature 
mining. Integrating these two data sets from both LncRNADisease 
database and literature search, we finally obtained 578 associations 
between 295 IncRNAs and 214 diseases. All of these 578 IncRNA- 
disease associations were then merged into a IncRNA-disease 
association network. 

Of the 214 diseases, 160 diseases and their causative genes could 
be found using their MIM number in OMIM database [11]. In 
total, we downloaded 80 1 disease genes for these 1 60 diseases from 
OMIM database. Such data resulted in 980 protein-coding gene- 
disease associations that were used in our method. 

Integrating IncRNA-disease associations and protein-coding 
gene-disease associations obtained above, we obtained 1558 
coding-non-coding gene-disease associations between 1096 genes 
(295 IncRNAs and 801 protein-coding genes) and 214 diseases. 
These associations were used to construct the coding-non-coding 
genes-disease bipartite network. 



IncRNA-disease association network 




Figure 1. Construction of the IncRNA-disease bipartite network. (Center) A subnetwork of the full IncRNA-disease association network (Figure 
S1), where the blue circles and red hexagons correspond to diseases and IncRNAs, respectively. A link is placed between a disease and a IncRNA if 
mutations or dysfunctions in that IncRNA lead to the specific disease. The size of a blue circle is proportional to the number of IncRNAs participating 
in the corresponding disease. The size of a red hexagon is proportional to the number of diseases associated with the corresponding IncRNA. (Left) 
The IncDN projection of the center graph, in which two diseases are connected if there is a IncRNA implicated in both diseases. (Right) The DlncN 
projection of the center graph where two IncRNAs are connected if they are involved in the same disease. 
doi:1 0.1 371 /journal.pone.0087797.g001 
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(a) IncRNA-implicated disease network 




(b) disease-associated IncRNA network 
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Figure 2. The IncDN and DlncN. (a) In IncDN, each node corresponds to a distinct disease, colored based on the disease class [26] to which it 
belongs. The names of 20 disease classes are shown on the right panel. A link between two diseases exists if they share at least one implicated 
IncRNA. The size of the node is proportional to the degree of the node in IncRNA-disease association network. We label the diseases associated with 
more than five IncRNAs by their names, (b) In DlncN, each node is a IncRNA, with two IncRNAs being connected if they are implicated in the same 
disease. The size of each node is proportional to the number of diseases in which the IncRNA is implicated. The color of a node is based on the class 
of diseases in which the corresponding IncRNA implicated. Nodes are light purple if the corresponding IncRNAs are associated with more than one 
disease class. We label the IncRNAs implicated in more than five diseases by their names. 
doi:1 0.1 371 /journal.pone.0087797.g002 
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Methods 

Given a bipartite network G(X,Y,E), X and Y were two 
disjoint node sets, E was the edge set in which the element 
represented the edge connecting the node from X and the node 
from Y. The bipartite network could be viewed in one-mode 
projection onto X and one-mode projection onto Y, called X 
projection and Y projection respectively. The X projection of G 
was a network in which nodes were from X, and the edge 
indicated that the connected nodes were associated with at least 
one same node from Y. Similarly, the Y projection of G was a 
network in which nodes were from Y, and the edge indicated that 
the connected nodes were associated with at least one same node 
from X. With regard to the IncRNA-disease association network, 
which could be claimed as a bipartite network, IncDN and DlncN 
were the disease projection and IncRNA projection of the 
IncRNA-disease association network. The properties of these two 
projections were analyzed in the "Results" section. It was found 
that the IncDN could reflect the relationships between any two 
diseases at the IncRNA level and that DlncN could reflect the 
relationships between any two IncRNAs at the disease level. 
Moreover, we tried to exploit these relationships to predict the 
hidden IncRNA-disease associations. For better performance, 
both protein-coding genes and IncRNAs that were implicated in 
diseases were considered together. As a result, a coding-non- 
coding gene-disease bipartite network was constructed to reflect 
the associations between diseases and all the disease-causing 
genes (i.e. protein-coding genes or IncRNAs). The resource- 
allocation process [12], as one of the best weighting methods for 
one-mode projection of a bipartite network, was used to weight 
the gene projection of the coding-non-coding gene-disease 
bipartite network. Then a propagation algorithm was applied to 
compute the association score for each gene that was used to 
measure how much the gene could be implicated in a disease on 
the weighted gene projection. For a disease q, every gene had its 
initial information. Our propagation algorithm could be assumed 
as a process where genes pumped their initial information to their 
neighbors, and every gene propagated the information received 
in the previous iteration to other genes via edges in gene 
projection. 

Next, we illustrated the principle of the resource-allocation 
process, and then provided the propagation algorithm to compute 
the score of genes with respect to a specific disease. 

Principle of the resource-allocation process 

We divided the nodes of a bipartite network into two sets X and 
Y, and only the connections between two nodes in different sets 
are allowed. The resource-allocation process is one of the best 
weighting method for one-mode projection of a bipartite network 
[12]. This process was illustrated in Figure 3 and included the 
following two steps. First, we allocated resources from X to Y. 
Second, we then allocated resources from Y back to X. The initial 
resource of five nodes was a,b,c,d and e in set X. These two steps 
of the resource-allocation process were merged into one, and the 
final resource of X nodes denoted by a',b' ,c',d' and e', could be 
written as: 
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Figure 3. Principle of the resource-allocation process in a 
bipartite network. The green rectangles represent X nodes and red 
hexagons represent Y nodes. The whole process consists of two steps: 
First, the resource flows from X to ¥ (a->b), and then returns to X 

(b->c). 

doi:10.1371/journal.pone.0087797.g003 

The 5x5 matrix, W , represented the weighted X projection. The 
element represented the fraction of resource that the j-th X 
node transferred to i-lh X node, and could be considered as the 
importance of node i on node j [12]. 

For a bipartite network G(X, Y,E), and \X\ =n,\ Y\ =m, x, was 
the i-th X node and y\ was the /-th Y node 
calculated as follows [12]: 



Wg could be 



v - 1 f 1 ^ 



(2) 



where 1 < ij<n, an was the nxm adjacent matrix of G(X, Y,E), 
and k(xi) was the degree of x t . 

The propagation algorithm 

The coding-non-coding gene-disease bipartite network was 
denoted by A(G,D,E), where G was the node set of the genes, D 
was the node set of the diseases, and E was the edge set. The 
weighted gene projection of A was denoted by W, where Wy was 
calculated by Formula (2) and represented the importance of gene 
i on gene j in terms of their association with disease. 

Our propagation algorithm was based on a semi-supervised 
learning algorithm [13], which had been previously used to 
prioritize protein-coding genes implicated in human diseases [14] 
and annotate functions of IncRNAs [15]. The input of the 
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algorithm included A(G,D,E), a query disease q, and W. For 
disease q, every gene had its own initial information. If a gene was 
connected with q in the coding-non-coding gene-disease bipartite 
network, the initial information was one; otherwise the initial 
information was zero. For a given disease q, the score vector / of 
genes represented the association scores of genes with q, which was 
computed by an iterative algorithm. The genes were ranked for q 
by the final score vector. Of all the genes not associated with 
disease q, the top 1% ranked genes were considered as the 
predicted genes. The score vector / was defined as: 

f=Wxf (3) 

An iterative process [14] was applied to compute the score vector 
in Formula (3). Considering the initial information on the genes for 
the given disease, the score vector / was computed iteratively as 
follows, 

/' = axfFx/'- 1 + (l-a)x/ 0 (4) 

In Formula (4), the score vector was initialized as f° by the initial 
information on genes. The parameter ote(0,l) gave the relative 
importance between the contributed information of other genes 
and the initial information of itself. The final score vector with 
respect to disease q was determined by both the information on 
other genes and its initial information. The iterative computation 
was controlled by the mean score deviation of the two neighboring 
score vector. All the testes on the real data and random data had 
shown that the iterative computation converges eventually (Table 
SI). 

Results 

Properties of IncRNA-disease association network 

The available IncRNA-disease associations were modeled as a 
bipartite network, and a subnetwork of this network was shown in 
Figure 1. In this bipartite network, one node set corresponded to 
the disease set; the other set corresponded to the IncRNA set. A 
IncRNA and a disease were connected by a link if the IncRNA was 
associated with the disease. The constructed bipartite network 
contained 578 edges between 295 IncRNA nodes and 214 disease 
nodes. 

The degree distribution of the full IncRNA-disease association 
network (Figure SI) closely followed a power-law distribution 
(Figure S2-(a)). We also analyzed the degree of disease nodes and 
that of IncRNA nodes separately. The degree of a disease node, 
which meant the number of IncRNAs associated with the disease, 
was denoted by s and had a broad distribution (Figure S2-(b)). 
These results indicated that most disorders were associated with a 
small number of IncRNAs, whereas a handful of diseases, such as 
breast cancer and lung cancer, were related to a large number of 
IncRNAs. For example, 41 IncRNAs were involved in breast 
cancer (s = 41), 18 IncRNAs were related with prostate cancer 
(s= 18), and 28 IncRNAs were involved in lung cancer (s — 28). 
The degree of a IncRNA node, i.e. the number of diseases 
associated with the IncRNA, was denoted by d, and had a broad 
distribution as well (Figure S2-(c)). This indicated that many 
IncRNAs were related to a few diseases whereas a small number of 
IncRNAs could be related to dozens of diseases. For example, 
XIST (d = 50) was associated with 50 diseases, including 40 skin 
diseases [16] and certain types of cancers such as testicular cancer 
[17] and breast cancer [18]. HI 9 (d = 39) was associated with 39 
diseases, including Beckwith-Wiedemann syndrome [19], Silver- 



Russell syndrome [20,21] and many types of cancer [22]. MEG3 
(d — 23) was associated with 23 diseases, including breast cancer 
[23], bladder cancer [23], glioma [24], and Wilms' tumor [25], 
etc. These IncRNAs represented major hubs in DlncN (Figure 2- 

(b)). 

Network analysis of IncDN and DlncN 

We performed a network analysis of IncDN and DlncN to help 
us understand the IncRNA-disease associations. Two biologically 
relevant network projections, IncDN and DlncN, were generated 
(Figure 2) based on the IncRNA-disease association network. 
Specifically, IncDN provided a disease centered view of the 
IncRNA-disease association network (Figure 2-(a)). DlncN was 
complementary to IncDN and offered a IncRNA centered view of 
the IncRNA-disease association network (Figure 2-(b)). Especially, 
the links between two IncRNAs in DlncN signified the disease 
phenotypic associations, which might be a measure of their 
functional correlations and could be used in future studies. 

Degree distributions of IncDN and DlncN. The degree 
distribution of the IncDN was investigated (Figure S3-(a)). The 
results showed that most disorders linked to only a few other 
diseases, whereas only few disorders represented hubs that were 
connected to a large number of distinct disorders. Such hub 
disorders included breast cancer (linked to 150 other disorders, i.e. 
n— 150, here n meant the degree of a node in IncDN), prostate 
cancer (n = 144), and lung cancer (n = 73). The degree distribution 
of the DlncN (Figure S3-(b)) was similar to that of IncDN. We 
could see that the degrees of most IncRNAs were small, whereas a 
few IncRNAs linked to a large number of IncRNAs. For example, 
MEG3 linked to 196 other IncRNAs, ANRIL linked to 166 other 
IncRNAs, and PVT1 linked to 162 other IncRNAs. These highly 
connected IncRNAs represented hubs in DlncN which connected 
to a large number of diseases in IncRNA-disease association 
network. We concluded that the degree distributions of both 
IncDN and DlncN networks closely followed a power-law 
distribution, despite of the incompleteness and false positive rate 
of the known IncRNA-disease associations. 

Comparison of IncDN and DlncN with random 
networks. In IncDN, there were 3061 links among 214 
individual diseases. Of the 214 diseases, 197 had at least one link 
to other diseases and 182 diseases formed a giant connected 
component. In DlncN, there were 6989 links among 295 
IncRNAs. Of the 295 IncRNAs, 276 had at least one link to other 
IncRNAs and 265 IncRNAs formed a giant connected component. 

We randomly shuffled the IncRNA-disease association network 
for 1 0 4 times, while keeping the degree of each IncRNA and each 
disease in the bipartite network unchanged [26]. We constructed 
the corresponding r-lncDN and r-DlncN respectively for the 
disease and IncRNA centered view of the randomized IncRNA- 
disease association network. Comparing IncDN and DlncN with r- 
lncDN and r-DlncN, respectively, we found that the topology 
property of the two generated networks, IncDN and DlncN, 
deviated from random. The average size of the giant connected 
components of 10 4 r-lncDNs was 137 + 6, which was significandy 
smaller than 182 (p-value < 1 0 - 8 , ^-test), the actual size of the 
giant connected component of IncDN. Similarly, the average size 
of the giant connected components of 10 4 r-DlncNs was 215 + 7, 
which was significandy smaller than 265 (p-value < 10 -8 , <:-test), 
the actual size of the giant connected component of DlncN. 
Considering disease classes as defined in the Goh et al.'s study 
[26], we found that diseases (IncRNAs) were more likely to be 
linked to the diseases in the same class in the actual networks. For 
example, in the IncDN, there were 806 links between diseases of 
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the same class, a two-fold enrichment with respect to 397 + 47 
links obtained between the same set of nodes in the randomized 
networks. These differences suggested important pathophysiolog- 
ical clustering of diseases and disease associated IncRNAs. 

Clustering coefficients of IncDN and DlncN. To further 
address the topological properties of IncDN and DlncN, we 
calculated the average clustering coefficient, a measure of the 
tendency of nodes in a network to form clusters or groups [27], by 
NetworkAnalyzer [28], a plugin of cytoscape software [29]. We 
found that the average clustering coefficients of nodes in both 
networks approximately diminished when the degree of node 
increased (Figure S4), indicating that nodes with high degrees 
tended to be hub nodes in both networks. In addition, we 
calculated the clustering coefficients of IncDN as the average of the 
clustering coefficient of all the vertices in IncDN [30], and the 
clustering coefficients of 550 randomly generated networks with 
the same degree sequence as IncDN [31]. The average clustering 
coefficient of the randomized networks was 0.45 + 0.01, which was 
significantly smaller than 0.81 (p-value< 10~ 10 , «-test), the 
clustering coefficient of IncDN. Likewise, we generated 550 
randomized networks with the same degree sequence as DlncN. 
The average clustering coefficient of the 550 randomized networks 
was 0.53 + 0.01, which was significantly smaller than 0.91 
(p-value< 10~ 10 , £-test),the clustering coefficient of DlncN. These 
results indicated that IncDN and DlncN revealed obvious 
community structure. Therefore, in the following section, we 
would like to analyze the modules of IncDN and DlncN. 

Modules of IncDN and DlncN. We clustered the IncDN and 
DlncN by MINE (http:/ /apps. cytoscape.org/apps/mine), a plugin 
of cytoscape software [29]. As a result, we obtained 14 modules of 
IncDN and 1 9 modules of DlncN. The size of each module had a 
board distribution (Figure S5). 

Although the IncDN layout was generated without any 
knowledge on the disease classes, the resulted network was visibly 
clustered according to major disease classes (Figure 2-(a)). For 
example, most (seven in 11) diseases that belonged to cardiovas- 
cular were associated with ANRIL and were clustered together. 
Most (seven in eight) dermatological diseases were associated with 
XIST and were also clustered into one cluster. However, some 
IncRNAs might be of special importance as they were implicated 
in different cancers which were not clustered into a single cluster. 
For example, ANRIL was associated with 14 types of cancer and 
MEG3 was associated with 18 types of cancer. These observations 
suggested the complexity and heterogeneity of different types of 
cancers. 

In DlncN, IncRNA nodes were colored based on the class of 
diseases in which these IncRNAs were implicated. Nodes were 
light purple if the corresponding IncRNAs were associated with 
more than one disease class (Figure 2-(b)). We found that most 
IncRNAs were only implicated in certain type of cancers, and they 
were mostly clustered into one module. For example, 17 IncRNAs 
were only related to brain cancer, 22 IncRNAs were only related 
to breast cancer, and 84 IncRNAs were only related to glioma. 
However, the major hubs were related to more than one disease 
class, such as XIST that was related to 12 disease classes, HI 9 was 
related to seven disease classes, ANRIL was related to six disease 
classes, and MEGS was related to four disease classes. These results 
were consistent with the fact that many IncRNAs exhibited tissue- 
specific expression [32] and that a few IncRNA were expressed 
across many tissues, such as MEG3, XIST, and HI 9 [33]. 

Prediction of IncRNAs implicated in diseases 

We applied the propagation algorithm to predict the candidate 
gene-disease associations on the coding-non-coding gene-disease 



bipartite network. In this algorithm, there were two parameters to 
be tuned: a and t. The parameter a. gave the relative importance 
between the information that other genes contribute and the initial 
information. This parameter was tuned by LOOGV tests and 
"0.618" was chosen as our a based on this procedure. The 
parameter t represented the number of iterations. The iterative 
computation would stop if the mean square deviation of the 
coding-non-coding gene-disease association score matrix between 
the ?-th iteration and the (t— l)-th iteration was not greater than 
0.00001. With these two parameters, our algorithm ranked 2139 
potential gene-disease pairs (768 IncRNA-disease pairs and 1371 
coding gene-disease pairs) within top 1 % for all the diseases. In the 
LOOCV procedure, our method achieved an AUC of 0.7881. 

Robustness of our bipartite network 

We tested the robustness of the coding-non-coding gene-disease 
bipartite network using the method of Multiple Survival Screening 
(MSS) [34], which was introduced to test the robustness of cancer 
causing genes by re-sampling experiments. Here, we performed 
1 000 times of re-sampling of our coding-non-coding gene-disease 
associations to predict the potential gene-disease associations. In 
each re-sampling experiment, we randomly removed 10% edges 
from the coding-non-coding gene-disease bipartite network, and 
then applied the propagation algorithm to predict the potential 
gene-disease associations on the remaining bipartite network with 
90% edges. 

If a gene g was ranked within top 1 % among all the genes 
according to the score vector for a given disease q, then the gene g 
was predicted to be associated with the disease q, i.e. the gene- 
disease pair (g,q) was considered as a predicted association. 
Applying the propagation algorithm on the coding-non-coding 
gene-disease bipartite network, we obtained 2139 predicted 
associations. For a predicted association (gi,qi) (1<;<2139), if 
the rank of g, was within top 1 % in a re-sampling experiment, 
then rii was increased by one. A vector N= (n\,ri2, ■ ■ ■ ,722139) was 
obtained, where «,e[0, 1000], meant the times of the predicted 
association (gj,qi) could be also predicted in 1000 re-sampling 
experiments. Furthermore, we performed 1000 times of random 
experiments. In each experiment, we randomly shuffled the coding 
non-coding gene-disease bipartite network, while keeping the 
degree of each gene and each disease in the bipartite network 
unchanged as above, and then applied the propagation algorithm 
to the randomized network. Similarly, a vector 
N r = (n\,n r 2 , ■ ■ ■ ^2139) ( r represented random) was obtained. We 
found that N was significantly larger than N r (p-value< 10" 10 , z- 
test), with most of w,s larger than 700, and most of ft^s smaller than 
250 (Figure 4). These findings suggested that even the 10% edges 
of the coding-non-coding gene-disease bipartite network were 
deleted, the predictive results were still stable. Therefore, our 
coding-non-coding gene-disease bipartite network was sufficiendy 
robust to predict potential coding or non-coding gene disease 
associations. 

Leave-one-out cross-validation tests 

To evaluate the power of our method, we applied the LOOCV 
procedure. In each test of LOOCV, a single gene-disease 
association was removed from the coding-non-coding gene-disease 
bipartite network, and the method was evaluated by its success in 
reconstructing the hidden association. If the degree of gene or 
disease node in the removed gene-disease association was exactiy 
one, then the gene or disease would be an isolated node. An 
isolated node in the propagation algorithm could not get any 
information, so we removed the nodes whose degree was one in 
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Figure 4. Comparison between re-sampling and random experiments to investigate the robustness of the bipartite network. Two 

box graphs represent the re-sampling experiments (left) and the random experiments (right), means the time of predicted association (gv,g,-) can 
be also predicted in re-sampling experiments and random experiments. 
doi:1 0.1 371 /journal.pone.0087797.g004 



LOOCV. Finally, we kept 532 links between 103 diseases and 163 
genes (mapped to 44 IncRNAs and 1 1 9 protein-coding genes) that 
were to be used in LOOCV. To the best of our knowledge, this 
was the first work of predicting potential IncRNA-disease 
associations in a network view, therefore, no previous methods 
could be directly compared with our method. We would compare 
the predictive performance of the propagation algorithm on 
different networks. 

The receiver operating characteristics (ROC) curve was used to 
measure the performance of our method, which plotted the true 
positive rate (TPR) versus the false positive rate (FPR) at different 
rank thresholds. In LOOCV, for a rank threshold k (1 <k< 100), 
TPR meant the percentage of the leave-out associations obtaining 
the rank within top k%; FPR meant the percentage of 
unassociated gene-disease pairs obtaining the rank within top 
k%. When the rank threshold was varied between 1 and 100, the 
corresponding TPR and FPR were obtained. In this way, the 
ROC curve could be plotted, and the AUC could be calculated. 
Following this procedure, we performed LOOCV over lncRNA- 
disease association network, and achieved an AUC of 0.6820. The 
ROC was shown in Figure 5. 

Aiming at improving the performance of our method, we 
integrated the protein coding gene-disease associations with 
IncRNA-disease associations to construct the coding-non-coding 
gene-disease bipartite network. Here, we also performed LOOCV 
procedure over coding-non-coding gene-disease bipartite network 
and obtained an AUC of 0.788 1 . The ROC was shown in Figure 5. 
Clearly, the integration of protein coding gene-disease associations 
could improve the performance of our method. One reason of the 
improvement was that the number of edges in the bipartite 
network was increased by the integration. Therefore, potential 
genes could get more information from other genes and diseases in 
propagation and could be better predicted. The better perfor- 
mance might be also attributed to the fact that coding and non- 
coding genes were cooperated in human diseases. Therefore, the 
performance of our method would be further improved after 



obtaining more known IncRNA-disease associations, and more 
associations between coding genes and non-coding genes. 

Moreover, we performed the LOOCV procedure over 50 
random networks. The mean FPR and mean TPR were used to 
plot the ROC curve (Figure 5), and we achieved an AUC of 
0.5005, smaller than AUCs of other two cases. This indicated that 
our coding-non-coding gene-disease bipartite network could reflect 
some mechanisms of human complex diseases, and our method 
could discover potential IncRNA-disease associations. 

Case study 

For each disease, all the genes (including coding and non-coding 
genes) were ranked according to their association scores with the 
disease. The genes ranked within top 10 (this was a user-defined 
threshold, and 10 was used here) were considered as the potential 
genes involved in the given disease. For all the 214 diseases in the 
coding-non-coding gene-disease bipartite graph, we uncovered 
768 novel IncRNA-disease associations between 66 IncRNAs and 
193 diseases. 

To further demonstrate the power of our method, we examined 
the results for three multifactorial diseases (i.e. Alzheimer's disease 
(MIM: 176807), pancreatic cancer (MIM: 260350) and gastric 
cancer (MIM: 137215)) as case studies. For each case, the top 10 
genes including protein-coding genes and IncRNAs were listed in 
Table 1. 

Results for Alzheimer's disease. Alzheimer's disease (AD) 
is the most common form of dementia in the elderly [35] and it is 
characterized by slow progressive loss of memory, cognitive 
abilities, and intellectual functions [36]. Currently, it has been 
reported that 23 genes including 6 IncRNAs and 17 protein- 
coding genes are associated with AD. The association scores of 
these 23 genes were higher than unassociated genes. In the top- 10 
ranked genes unassociated with AD, we found that the rank of 
IncRNA HI 9 was two, and the rank of lncRNA PVT1 was three. 
H19 had been associated with glioblastoma [37] and PVT1 had 
been associated with glioma [24]. Both glioblastoma and glioma 
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Figure 5. A comparison between the performance of our propagation algorithm on coding-non-coding gene-disease bipartite 
network and that on IncRNA-disease association network. The blue line represents the ROC curve of taking LOOCV over coding-non-coding 
gene-disease bipartite network, and an AUC of 0.7881 was obtained. The cyan line represents the ROC curve of taking LOOCV over IncRNA-disease 
association network, and an AUC of 0.6820 was obtained. The light blue line represents the ROC curve of taking LOOCV over random networks, and 
an average AUC of 0.5005 was obtained. 
doi:1 0.1 371 /journal.pone.0087797.g005 



Table 1. The top-10 ranked genes for three case studies. 



Alzheimer's disease 


gene 


ACC/MIM 


Rank 


gene 


ACC/MIM 


Rank 


COL4A2 


120090 


1 


IL1RN 


147679 


6 


H19 


NR_002196 


2 


EPO 


133170 


7 


PVTl 


NR_003367 


3 


SOD2 


147460 


8 


ALOX5AP 


603700 


4 


VEGF 


1 92240 


9 


PON1 


168820 


5 


F2 


1 76930 


10 


Pancreatic cancer 


gene 


ACC/MIM 


Rank 


gene 


ACC/MIM 


Rank 


H19 


NR_002196 


1 


FGFR3 


1 34934 


6 


ANRIL 


NR_003529 


2 


UCA1 


NR_015379 


7 


BC200 


NR_001S68 


3 


PIK3CA 


171834 


8 


MEG3 


NR_002766 


4 


CDH1 


1 92090 


9 


XIST 


NR_001564 


5 


SRA 


AF092038 


10 


Gastric cancer 


gene 


ACC/MIM 


Rank 


gene 


ACC/MIM 


Rank 


XIST 


NR_001564 


1 


ANRIL 


NR_003529 


6 


MALAT-1 


NR_002819 


2 


TP53 


191170 


7 


MEG3 


NR_002766 


3 


FGFR3 


1 34934 


8 


PVTl 


NR_003367 


4 


PTEN 


601728 


9 


BRCA2 


600185 


5 


RAD54L 


603615 


10 



In this table, the susceptibility genes (protein-coding genes and IncRNAs) for three case studies including Alzheimer's disease, Pancreatic cancer and Gastric cancer were 
listed. The genes in italic were IncRNAs and the others were protein-coding genes. 
doi:1 0.1 371 /joumal.pone.0087797.t001 
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were brain or neuron related diseases and AD was described as a 
neurological disease. All these suggested the relationship between 
these two IncRNAs and AD. 

Results for Pancreatic cancer. Pancreatic cancer has a 
high mortality rate and the 5-year relative survival rate is less than 
5% [38]. It has been previously shown that 18 genes including 5 
IncRNAs and 1 3 protein-coding genes are implicated in pancreatic 
cancer. The association scores of these 1 8 genes were also higher 
than unassociated genes. In the top- 10 ranked genes unassociated 
with pancreatic cancer, we found that the rank of ANRIL was two. 
Pasmant et al. [5] confirmed the pivotal role of ANRIL in 
regulation of CDKN2A/B expression through a cm- acting mecha- 
nism in mice and ANRIL implicated in proliferation and 
senescence. Furthermore, the association of CDKN2A 
(MIM:600160) with pancreatic cancer had been curated in 
OMIM [11]. The rank of UCA1 was seven, and Kaneko et al. 
[39] showed that UCA1 and BMF were upregulated in gallbladder 
epithelia of children with pancreaticobiliary malfunction. There- 
fore, our results suggested that UCA1 might be associated with 
pancreatic disease. 

Results for Gastric cancer. Gastric cancer is a high 
morbidity cancer and has varied morbidities in different popula- 
tions [40]. It has been presented that 15 genes including 4 
IncRNAs and 1 1 protein-coding genes are implicated in gastric 
cancer. The association scores of these 1 5 genes were higher than 
unassociated genes. In potential genes implicated in gastric cancer, 
the rank of XISTvias one. Weakley et al.'s study [41] showed that 
XZSTwas differentially expressed in preneoplastic cells located in 
gastric fundus that could lead to gastric cancer. The rank oiMEG3 
was three, and MEG3 was reported to function as a novel IncRNA 
tumor suppressor [42]. 

Discussion 

The IncRNA-disease association network was constructed, from 
which two relevant networks, IncDN and DlncN, were generated 
accordingly. These networks provided a unified framework of all 
known IncRNA and disease associations and a new network view 
for the study of the IncRNA-disease associations. The detailed 
IncRNA-disease association network (Figure SI) showed all the 
known IncRNA-disease associations. Furthermore, a computa- 
tional iterative algorithm was applied to mine the hidden IncRNA- 
disease associations. The results showed that our method could 
provide insightful suggestions of IncRNA implicated in diseases. 

Our method had some limitations that should be acknowledged. 
First, the analysis of the function of IncRNAs on a whole genome- 
wide scale was limited due to the diversity, lack of knowledge and 
specificity of expression of IncRNAs, and the lack of IncRNA 
functional annotation. Second, the shortage of IncRNA-disease 
associations limited the analysis of the mechanism of IncRNAs 
implicated in disease on a larger network. Finally, due to the lack 
of interactions and similarities between non-coding genes and 
protein coding genes, it was insufficient in biological meaning to 
replace the gene similarity matrix in Formula (4) by the weighted 
gene projection W. 

Supporting Information 

Table SI 20 tests on random networks to show that the 
propagation method converges. We did 20 tests. In test i 
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Figure SI Bipartite-graph representation of the 
IncRNA-disease association network. A disease (circle) and 
a IncRNA (hexagons) are connected if the IncRNA is implicated in 
the disease. The size of a node is proportional to the degree of the 
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Figure S2 Degree distribution of full IncRNA-disease 
association network, (a) The degree distribution of the full 
IncRNA-disease association network. It closely follows a power-law 
distribution. Here, k represents degree, p(k) denotes the fraction 
of nodes with a given degree k. (b) Degree distribution of disease 
nodes in IncRNA-disease association network, (c) Degree distri- 
bution of IncRNA nodes in IncRNA-disease association network. 
(TIF) 

Figure S3 Degree distribution of IncDN and DlncN. (a) 

Degree distribution of IncDN. It closely follows a power-law 
distribution. Here, k represents degree, p(k) denotes the fraction 
of nodes with a degree k. (b) Degree distribution of DlncN. It 
closely follows a power-law distribution. Here, k represents degree, 
p(k) denotes the fraction of nodes with a degree k. 
(TIF) 

Figure S4 Degree distributions of average clustering 
coefficients of nodes in IncDN and DlncN. (a) Degree 
distribution of average clustering coefficients of nodes in IncDN. 
(b) Degree distribution of average clustering coefficients of nodes 
in DlncN. Both distributions are closely following a power-law 
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DlncN. (a) The module sizes of 14 modules in IncDN. (b) The 
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(TIF) 
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